An Introduction to Python Dashboards

Marc Dotson

2025-02-28

  • Provide some guidance for the datathon
  • Focus on Python beginners
  • Assume you’ve installed Python (and an IDE)
  • Focus is on three specific data tools
  • Special focus on human-readable code
  • Need help? Raise your hand or talk after

Polars

  • New data wrangling library
  • Alternative to Pandas for using DataFrames
  • Fast – in Rust, uses Apache Arrow, built to parallelize and use GPUs, allows for lazy evaluation
  • More consistent syntax than Pandas
  • Anagram of its query engine (OLAP) and Rust (rs)

Import packages and read data

Install with pip install polars, we can import Polars and read in data.

import polars as pl
import os

customer_data = pl.read_csv(os.path.join('data', 'customer_data.csv'))
customer_data.shape
customer_data.columns

Filter Observations

Polars DataFrames have methods that mirror SQL. DataFrames are still composed of columns called Series, however, unlike Pandas DataFrames, Polars DataFrames don’t have a row index (so no need to .reset_index()).

Additionally, instead of relying on Pandas’ .loc[] property, Polars includes function-like expressions like pl.col('column_name').

customer_data.filter(pl.col('college_degree') == 'Yes')
customer_data.filter(pl.col('region') != 'West')
customer_data.filter(pl.col('gender') != 'Female', pl.col('income') > 70000)
customer_data_pd.loc[customer_data_pd['college_degree'] == 'Yes']
customer_data_pd.loc[customer_data_pd['region'] != 'West']
customer_data_pd.loc[(customer_data_pd['gender'] != 'Female') & (customer_data_pd['income'] > 70000)]
filter(customer_data, college_degree == "Yes")
filter(customer_data, region != "West")
filter(customer_data, gender == "Female", income > 70000)

Slice Observations

The parameters for Polars’ .slice() method are the start index and the length.

We similarly don’t need to rely on Pandas’ .iloc[] property.

customer_data.slice(0, 5)
customer_data_pd.iloc[0:5]
slice(customer_data, 1:5)

Sort Observations

To reiterate, instead of using [ ] like in Pandas, in Polars we rely on expressions like pl.col().

customer_data.sort(pl.col('birth_year'))
customer_data.sort(pl.col('birth_year'), descending = True)
customer_data_pd.sort_values('birth_year')
customer_data_pd.sort_values('birth_year', ascending = False)
arrange(customer_data, birth_year)
arrange(customer_data, desc(birth_year))

Select Variables

Polars separates out .filter() and .select() that are combined in Pandas’ .loc[].

customer_data.select(pl.col('region'), pl.col('review_text'))
customer_data.select(pl.col(['region', 'review_text']))
customer_data_pd.loc[:, ['region', 'review_text']]
select(customer_data, region, review_text)

Add Variables

Polars is actually a query language, like SQL. So it’s not surprising to see methods with names that more closely mirror queries, like the .with_columns() method.

customer_data.with_columns(income_new = pl.col('income') / 1000)
customer_data_pd.assign(income_new = customer_data_pd['income'] / 1000)
mutate(customer_data, income = income / 1000)

Join Data Frames

Joins are straightforward.

store_transactions = pl.read_csv('data/store_transactions.csv')
store_transactions

customer_data.join(store_transactions, on = 'customer_id', how = 'left')
store_transactions_pd = pd.read_csv('data/store_transactions.csv')
store_transactions_pd

customer_data_pd.set_index('customer_id').join(store_transactions_pd.set_index('customer_id'))
store_transactions <- read_csv("store_transactions.csv")
store_transactions

left_join(customer_data, store_transactions, join_by(customer_id))

Consecutive Lines of Code

While possible with Python code generally, Polars especially embraces writing consecutive lines of code using method chaining. Note that:

  • The entire chain needs to be surrounded with ( )
  • Each line starts with .
  • You run the whole block of code at once
(customer_data
  .join(store_transactions, on='customer_id', how='left')
  .filter(pl.col('region') == 'West', pl.col('feb_2005') == pl.col('feb_2005').max())
  .with_columns(age = 2024 - pl.col('birth_year'))
  .select(pl.col(['age', 'feb_2005']))
  .sort(pl.col('age'), descending=True)
  .slice(0, 1)
)
crm_data_pd = (customer_data_pd
  .set_index('customer_id')
  .join(store_transactions_pd.set_index('customer_id'))
)

(crm_data_pd
  .loc[(crm_data_pd['region'] == 'West') & (crm_data_pd['feb_2005'] == crm_data_pd['feb_2005'].max())]
  .assign(age = 2024 - crm_data_pd['birth_year'])
  .loc[:, ['age', 'feb_2005']]
  .sort_values('age', ascending = False)
  .iloc[0:1]
)
customer_data |> 
  left_join(store_transactions, join_by(customer_id)) |> 
  filter(region == "West", feb_2005 == max(feb_2005)) |> 
  mutate(age = 2024 - birth_year) |> 
  select(age, feb_2005) |> 
  arrange(desc(age)) |> 
  slice(1)

Summarize Discrete Data

(customer_data
  .group_by(pl.col('region'))
  .agg(n = pl.len())
)

(customer_data
  .group_by(pl.col(['region', 'college_degree']))
  .agg(n = pl.len())
)
(customer_data_pd
  .value_counts('region')
)

(customer_data_pd
  .value_counts(['region', 'college_degree'])
)
customer_data |> 
  count(region)

customer_data |> 
  count(region, college_degree)

Summarize Continuous Data

(customer_data
  .select(pl.col('income'))
  .mean()
)

(customer_data
  .select(pl.col(['income', 'credit']))
  .mean()
)
(customer_data_pd
  .loc[:, ['income']]
  .mean()
)

(customer_data_pd
  .loc[:, ['income', 'credit']]
  .mean()
)
customer_data |>
  summarize(avg_income = mean(income))

customer_data |>
  summarize(
    avg_income = mean(income),
    avg_credit = mean(credit)
  )

Summarize Discrete and Continuous Data

(customer_data
  .group_by(pl.col(['gender', 'region']))
  .agg(
    avg_income = pl.col('income').mean(), 
    avg_credit = pl.col('credit').mean()
  )
  .sort(pl.col('avg_income'), descending = True)
)
(customer_data_pd
 .groupby(['gender', 'region'])
 .agg(
   avg_income = ('income', 'mean'), 
   avg_credit = ('credit', 'mean')
  )
 .sort_values('avg_income', ascending = False)
)
customer_data |>
  group_by(gender, region) |>
  summarize(
    avg_income = mean(income),
    avg_credit = mean(credit)
  ) |> 
  arrange(desc(avg_income))

Don’t forget lazy evaluation!

Resources

Thank you!

marc.dotson@usu.edu
github.com/marcdotson
occasionaldivergences.com